I will be exploring the Prosper loan data set, a collection of loans made through the fintech company Prosper. Prosper describes itself as a peer-to-peer lending platform (or “marketplace”): according to the “About Us” section of the website, “individuals and institutions can invest in loans and earn attractive returns.”
The dataset includes 81 attributes and just shy of 114,000 records. Many of the attributes are factors or otherwise categorical, and many attributes are missing data or inconsistent between records. To me, this is indicative of underlying changes in Prosper’s spplication(s) or database(s) over time. Some qualifying evidence for that statement is the relationship between origination year and the alpha Prosper rating: before 2009, the attribute was not in use, so all of those records have a null value for this attribute.
Prosper has provided a data dictionary for data available through their API services. I used the definitions to identify several attributes of the dataset to start my investigation.
To correct some deficiencies in the data, I’ll be performing some data wrangling to clean the data. The first place I’ll start is with the Income Range attribute isn’t ordered nicely, so let’s restructure it to be more interpretable. I’ll do the same with the Prosper Rating. I’m ordering these so that they show from least to greatest, with null or otherwise empty data as the tail. I’ll sort the borrower state alphabetically.
loans$IncomeRange <- factor(loans$IncomeRange, levels =
levels(factor(loans$IncomeRange))[c(1,2,4,5,6,3,8,7)])
loans$ProsperRating..Alpha. <- sub("^$", "NA", loans$ProsperRating..Alpha.)
loans$ProsperRating..Alpha. <- factor(loans$ProsperRating..Alpha., levels =
levels(factor(loans$ProsperRating..Alpha.))[c(8, 7, 6, 5, 4, 3, 1, 2)])
loans$BorrowerState <- factor(loans$BorrowerState, levels =
sort(unique(loans$BorrowerState))[c(2:52,1)])
Additionally, I want to replace some missing data and convert data types. Total Prosper Loans are Null for any borrower that didn’t have a previous loan, but I’d like to see those as ‘0’. Origination date should be in a date typ rather than a character, both because it’s useful and because, coming from a database background, the deep sense of inner purity it leaves me with.
loans$TotalProsperLoans[is.na(loans$TotalProsperLoans)] <- 0
loans$LoanOriginationDate <- as.Date(loans$LoanOriginationDate, "%Y-%m-%d %H:%M:%S")
##
## 0 1 2 3 4 5 6 7 8
## 91853 15538 4540 1447 417 104 29 8 1
We can see that the overwhelming majority of loans on the Prosper platform are the first loan on the platform, but I’d be interested to see if the number of repeat customers over time increases.
It looks like people are coming back for loans as time goes on. This question would be a good for further exploration of the data: is there a meaningful difference between the growth of return clients and new clients?
## [1] 640 680 480 800 740 700 820 760 660 620 720 520 780 600 580 540 560
## [18] 500 840 860 NA 460 0 880 440 420 360
## [1] 19 NA
I thought credit score might be an interesting attribute to use in deeper analysis, but after examining the actual values I’m… unimpressed with its usefulness. Every loan is masked by providing a credit score in a 19 point range. In retrospect, that makes sense; it helps make the individual record harder to be personally identifying. The default binwidth of 30 obscures the discrete nature of the scores in the data, but when you adjust the binwidth to something more reasonable (in this case, 5), you can see that the values are masked. Despite this, an shape approximating normal emerges.
I was interested in a two other, possibly related attributes: Prosper score and Prosper rating. They have similar counts and shapes, fairly normalized – when you ignore the fact that there’s a massive pile of people without Prosper ratings. After some digging, it looks like the alpha Prosper score wasn’t put into use until 2009.
Although a garish, this chart demonstrates the split in the dataset: in 2009, Prosper switched to this rating model. The switch suggests an interesting problem for how to handle the data. Should entities prior to 2009 be dropped?
Like an insane Physics teacher that grades on a curve to force the world into their mathematical model, these three attributes have approximately normal distributions. This similarity speaks to something in the way people are treated and classified by major financial institutions, though the ethics of arbitrarily assigning people to a manufactured normal distribution is probably the subject of a policy piece than an EDA project for a MOOC.
I wanted to see whether previous loans on the platform might affect the internal grading system Prosper uses, but nothing immediately stands out to me.
The borrower APR and Rate have similar distributions – which makes sense, since they’re related attributes of a loan.
##
## 173.71 0 172.76
## 2423 935 536
Monthly loan payments skew toward lower, with a huge spike around $175. A $0 monthly payment is the second most frequent amount, which is anamalous, and suggests further scrubbing should be completed. However, I imagine that there’s a reason the Prosper system includes $0 monthly payments at all. It’s possible it indicates something particular about these entities that requires specific domain knowledge to interpret.
Holy cannoli, bat-related superhero! A massively predominant number of loans have a single lender. This feature belies the implicit promise of the Prosper lending platform: ‘crowd-sourced’ borrowing. I’m curious about who the investors are, as Prosper is starting to seem more and more like a traditional financial institution rather that a fresh take on the trillion dollar wealth transfer.
Still, it’d be interesting to see what the distribution looks like in more detail.
## [1] 36 60 12
The platform allows 1, 3, and 5 year loans.
I thought about running this chart with a binwidth of 50, but 500 makes a more reasonable graph while still highlighting an important feature of the original loan amounts: they spike around even numbers, and especially the thousands mark. An interesting point to note is that Prosper has a minimum loan amount of $2000.
This pairwise comparison chart shows some interesting features of the data that I’ll explore in more depth independently. Important to note are the relationship between APR and Rate and OriginalAmount and MonthlyPayment
Not unexpectedly, a larger loan tends to have a longer term.
Again unsurprisingly, a larger loan typically involves a larger monthly payment. This makes intuitive and anecdotal sense, but it’s good to see it in the data.
We can see, too, that higher earners take larger loans on average, although the variability of the loan amount also tends to increase.
APR decreases as Prosper score improves. I understand this tendency to be a way to defray lending to high-risk borrowers. I’d be interested in exploring whether the assigned rating – and implicitly the APR – affects the rate of default in higher risk populations. I think this would need to be designed as a controlled experiment.
What’s most interesting to me about this chart is not that the median amount seems to level out around $10k, but that the spread of the loan amount does as well. This could be indicative of the increased number of successful borrowers with higher ratings, but it suggests to me that rating isn’t necessarily correlated with the amounts people want to borrow.
A higher income range trends toward a lower APR. Curiously, people without an income have a low APR?
## [1] 640 680 480 800 740 700 820 760 660 620 720 520 780 600 580 540 560
## [18] 500 840 860 NA 460 0 880 440 420 360
This chart was a quick check to confirm my earlier instincts about credit scores. Neither variability nor median appear that different for the credit score range across income categories.
The nearly 1:1 relationship between these merely indicates they’re essentially the same feature for analysis.
An interesting feature of this chart is that you can see the frequent loan amounts stand out in the noise. Still, a lower loan correlates with a higher APR.
The really interesting part of this chart is that the regression line shows a negative correlation (higher APR for lower monthly payment), but visually you might interpret the opposite.
My analysis/charts show some fairly standard features of the data that you might expect from financial data: higher incomes tend to lead to larger loans and better scores; larger loans tend to lead to better rates; and higher incomes tend to have better rates. You should expect this kind of interpretation from financial data since people with more income are generally favored by our financial systems over the disenfranchised.
Adding term to the pairwise comparison reveals some interesting features that I’ll explore in depth.
To me, this chart illuminates that term doesn’t really impact APR or Origination Amount: no color is really distinct in the chart.
While not a color-blind friendly graph, this chart shows some clusters in the data. The cooler colors are higher income, and tend toward larger loans and lower APRs. The most prominent color is in the $50-75k range, chilling around the regression line.
The three distinct clusters of term are very interesting to me. There’s definitely a strong relationship between the term, original amount, and the loan payment. Obviously, the monthly payment increases with the loan amount, but the specific ratio between the two clearly varies by the specific term selected. This is pretty obvious if you’ve ever taken out a loan, but still fascinating to see in the data.
The same branches of payment to loan amount appear here, but there don’t seem to clear clusters. There is a gradual increase of loan amount as income increases, something we learned earlier.
Unlike total loan amount, debt to income ratio doesn’t stand out as much for the selected term.
This is a really fascinating chart to me. Several groupings stand out pretty phenomenally. There’s a sharply distinct tendency for higher income to have a lower debt to income ratio. The extreme cases of debt to income ratio are among lower income brackets or those unemployed.
Using term and income brackets was productive in pushing some of the correlations discovered in the data. The categorical attributes showed a few interesting features of the data. The first was how clearly term indicated which of three relationships between loan amount and monthly payment. The second, for me, was the relationship between income and debt to income ratio.
This chart demonstrates the distribution of the Prosper rating score as a proportion of the total loans in a given year. This chart tells us several important and suggestive features of the data. First, Prosper did not begin using this rating system until 2009, a year with a comparative dearth of loans on the platform. What this suggests to me, with only general contextual knowledge, is that Prosper re-evaluated its rating system in the wake of the financial crisis and the Great Recession – although I should be careful to note that you cannot conclude that from this chart.
Secondly, using 2013 as an example year, the distribution of ratings across the loans looks to be normalized. This speaks to how a credit rating is intended to be a standardized measure to evaluate a borrower. It’s a rating to categorize and label a person based on the articial measurement assigned and creating by the investing class. You could speculate that toward the lower end of the scale more loans are denied, toward the higher less are needed; regardless, what is clear is credit score isn’t necessarily a disqualifying measure to receive a loan. However, one might presume that it influences other attributes of the loan.
As we can clearly see in this chart, people with higher credit ratings tend toward lower APRs. There are numerous factors influencing this relationship: the attributes of a loan determine the APR, generally according to the formula and variables of the lending financial institution. The best takeaway of this relationship, in my mind, is not that a higher Prosper rating leads to a lower APR, but rather that people who have higher Prosper ratings tend to select loans with a lower APR.
This plot shows a positive correlation between APR and estimated yield: that is, as the APR on a loan increases, so does the investor’s return. To state that again, a lower Prosper rating tends toward a higher yield for investors. You can see in scale of the color that higher ratings are clustered together, tending to gravitate toward lower ratings the higher the APR. Read with the earlier two charts, this seems a fairly obvious conclusion to draw: higher rated borrowers tend to choose (or be approved for) a lower APR, and the distribution of ratings tends to be fairly normalized. The amount of the loan doesn’t seem to be indicative of any of the other factors in the plot.
This dataset left me with a lot of questions, some from a data architecture perspective and some from a more sociopolitical one. Prosper has clearly changed several features of their platform over the past decade, the primary which being how borrowers are graded. This dataset was scrubbed so that it cannot be personally identifying, which is an important analytical standard for open datasets. However, it would be fascinating to know certain demographic could contextualize the dataset. For instance, who are the repeat borrowers? What other attributes in the data have changed over time, and how should that impact wrangling the data for more sophisticated analysis or engineering?
THe other direction of questions I have are related to the justness of Prosper as a lending platform. Are the fees and rating systems and various mechanisms surrounding the financial sector damaging to those who might take out loans? A more tailored analysis might be able to tease out the details of how the various attributes of a loan relate to the chance of default – and how that affects a person’s Prosper rating, and potentially their credit score, all while highlighting the vicious cycle which likely doesn’t expose the investors to much risk.